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(54) Multilingual pronunciations for speech recognition 



(57) There is provided a novel approach for gener- 
ating multilingual text-to-phoneme mappings for use in 
multilingual speech recognition systems. The multilin- 
gual mappings are based on the weighted outputs from 
a neural network text-to-phoneme model, trained on da- 
ta mixed from several languages. The multilingual map- 
pings used together with a branched grammar decoding 
scheme Is able to capture both inter- and intra-language 



pronunciation variations which is ideal for multilingual 
speaker independent speech recognition systems. A 
significant improvement in overall system performance 
is obtained for a multilingual speaker independent name 
dialling task when applying multilingual instead of lan- 
guage dependent text-to-phoneme mapping. 
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Description 



[0001] The invention relates to a method of speech recognition comprising receiving an acoustic input and a speech 
recognition apparatus comprising transducer means for receiving an acoustic input and processing means 
5 [0002] Spealter independent command word recognition and name dialling on portable devices such as mobile 
phones and personal digital assistants has attracted significant interest recently. A speech recognition system provides 
an alternative to keypad input for limited size portable products. The speaker independence makes the system partic- 
ularly attractive from a user point of view compared to speaker dependent systems. For large vocabularies user traininq 
of a speaker dependent recogniser is likely to become too tedious to be useful. 

[0003] How to build acoustic models that integrate multiple languages in automatic speech recognition applications 
IS described by F. Palou, P. Bravetti, O. Emem, V. Fischer, and E. Janke, in the publication "Towards a Common Phone 
Alphabet for Multilingual Speech Recognition", In Proceedings of ICSLP, pages 1—1 ,2000. 
[0004] An architecture for embedded multilingual speech recognition systems is proposed by O Viiki I Kiss and J 
Tian, in the publication "Speaker- and Language-Independent Speech Recognition in Mobile Communication Systems"' 
'5 in Proceedings of ICASSP, 2001 . 

[0005] Use of neural networks for TTP giving estimates of the posterior probabilities of the different phonemes for 
each letter input is taught by K. Jensen, and S. Riis, in the publication "Self-Organizing Letter Code-Book forText-To- 
Phoneme Neural Networi< Model", published In Proceedings of ICSLP, 2000. 

[0006] A phoneme-based speaker independent system is ready to'use "out-of-the-box" and does not require any 
training session of the speaker. Furthermore, if the phoneme based recogniser Is combined with a text-to-phoneme 
(TTP) mapping module for generating phoneme pronunciations online from written text, the user may define specific 
vocabulanes as required in e.g. a name dialling applteation. Naturally, speaker and vocabulary independence comes 
at a cost, namely increased complexity for real-time decoding, increased requirements for model storage and usually 
also a slight drop in recognition perfonnance compared to speaker dependent systems. Furthermore speaker inde- 
pendent systems typteally contain a number of language dependent modules, e.g. language dependent acoustic pho- 
neme models, TTP modules etc. For portable devices, the support of several languages may be prohibited by the 
limited memory available in such devices as separate modules need to be stored for each language 
[0007] Recently, systems based on multilingual acoustfc phoneme models have emerged — see the letters written 
by Palou et al.. and Viikiet al.. mentioned above. These systems are designed to handle several different languages 
simultaneously and are based on the observation that many phonemes are shared among different languages The 
basic Idea in multilingual acoustic modelling is to estimate the parameters of a particular phoneme model using speech 
data from all supported languages that include this phoneme. Multilingual speech recognitton is very attractive as it 
makes a particular speech recognition application usable by a much wider audience. In addition the logistic needs is 
reduced when making world wide products. Furthermore, sharing of phoneme models across languages can signifi- 
cantly reduce memory requirements compared to using separate models for each language. Multilingual recognisere 
are thus very attractive for portable platforms with limited resources. 

[0008] Even though multilingual acoustic modelling has proven effteient, user definable vocabularies typically still 
require language dependent TTP modules for each supported language. Priorto running the language dependent TTP 
module it is f urthennore necessary to first identify the language ID of each vocabulary entry. 

Language Dependent text-to-phoneme (TTP) Mapping 

[0009] For appltoations like speaker independent name dialling on mobile phones the vocabulary entries are typballv 
names in the phonebook database 21 that may be changed at any time. Thus, for a multilingual speaker independent 
name dialler 22 to worit with language dependent TTP, a language Identification module (LID) 30 is needed An example 
of a multilingual speech recognition system according to prior art using a LID module 30 is shown In RgureS In Figure 
3, It IS shown how the LID module 30 selects a language dependent TTP module 31 .1-31 .n that is used tor generating 
the pronunciation by means of a pronunciation lexrcon module 32 for a multilingual recogniser 33 based on multilingual 
acoustic phoneme models. 

so [0010] The LID module 30 may be a statistical model predicting the language ID of each entry in the vocabulaiv a 
detemiinistic module that sets the language ID of each entry based on applteation specrfte knowledge, a module that 
simply requires the user to set the language ID manually or a combination of these. In the most general case, a priori 
knowledge about the language ID is not available and manual language identmcatlon by the user may not be desirable 
In that case, language identification must be based on a statistk:al LID module that predkrts the language ID from the 

55 written text. 

[0011] Depending on the application, the TTP module 31.1-31.n may be a statistical model, a rule based model 
based on a lookup table that contains ail possible words, or any combination of these. The latter approach will typically 
not be possible for name dialling applications on portable devices with limited memory resources due to the large 
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number of possible names. 

[0012] In most applications based on user defined vocabularies, a statistical LID module 30 has very limited text 
data for deciding the language ID of an entry. For e.g. a short name like "Peter, only five letters are available for 
language identification. Furthermore, many names are not unique for a single language but rather used in a large 

5 number of languages with different pronunciations. In addition to this, a speaker may pronounce a foreign/non-native 
name with a significant accent, i.e., the pronunciation of the name is actually a mixture of the pronunciation corre- 
sponding to the language from which the name originates and the native language of the speaker. 
[0013] This implies that the combination of language dependent TTP modules 31 .1-31 .n, a statistical LID module 30 
and multilingual acoustic phoneme models is likely to give a poor overall perfonnance. Furthermore, if several lan- 

10 guages are^o be supported in a portable device, the size of the LID and TTP modules may have to be severely limited 
in order to fit into the low memory resources of the device. For "irregular" languages, like English, high accuracy TTP 
modules may take up as much as 40—300 kb of memory, whereas TTP modules for rule based "regular" languages 
like Japanese and Finnish typically require less than 1 kb. 

[0014] A method according to the present invention is characterised by generating sequences of multilingual pho- 
15 neme symbols based on a semantic code or code sequence, generating reference speech signals in dependence on 
said sequences of multilingual phoneme symbols and comparing said reference speech signals with the acoustic input 

In order to find a match. 

[0015] A speech recognition apparatus according to the present invention is characterised In that the processing 
means is configured for generating sequences of multilingual phoneme symbols based on a, semantic code or code 
20 sequence, generating reference speech signals in dependence on said sequences of multilingual phoneme symbols 
and comparing said reference speech signals with the acoustic Input in order to find a match. 
[0016] The semantic code or code sequence may represent characters of a known writing system, such as Roman, 
Cyrillic and Korean alphabets, Japanese syllabaries and Chinese ideographs. 

[0017] Where, the semantic code or code sequence represents one or more characters of an alphabet or a syllabary, 
25 it Is preferable that the code or code sequence be processed character by character and a neural network provide an 
estimate of the posterior probabilities of the different phonemes or phoneme sequences for each character. 
[0018] Preferably, a user input is received, a semantic code or code sequence is generated in dependence on said 
user input and stored in a store, and the stored code or code sequence is retrieved for generation of said reference 
speech signals. 

30 [0019] Preferably, the neural network comprises a standard fully connected feed-fonward multi-layer perceptron neu- 
ral network. 

[0020] An apparatus according to the present invention may be advantageously incorporated into a communication 
tenninal, for example a mobile phone. 

[0021] In the case of a communication tenminal. the store connprises data fomriing an electronic phonebook including 
35 phone numbers and associated name labels. 

[0022] An embodiment of the present invention will now be described, by way of example only, with reference to the 
accompanying drawings, in which:- 

Figure 1 schematically illustrates a preferred embodiment of a hand portable phone according to the invention. 
40 Figure 2 schematically shows the essential parts of a telephone for communication with e.g. a cellular network. 

Figure 3 shows as a block diagram a multilingual speech recognition system employing a LID module and language 
specific TTP modules according to prior art. 

Figure 4 shows as a block diagram a multilingual speech recognition system employing a Multilingual TTP module 
according to a preferred embodiment of the invention. 
45 Figure 5 shows a branching diagram according to the preferred embodiment of the invention for pronunciation of 

name "Peter" in Gernian (p-e:-t-6), English (o-i:-t-@) and Spanish (p-i-t-e-r) arranged as a branched grammar. 
The values on the arcs between phonemes indicate probabilities of the phonemes as provided by e.g. the TTP 
module. SAMPA notation is used for phonemes. 

so [0023] Figure 1 shows a prefen-ed embodiment of a phone according to the invention, and it will be seen that the 
phone, which is generally designated by 1 . comprises a user interface having a keypad 2, a display 3, an on/off button 
4, a speaker 5 (only openings are shown), and a microphone 6 (only openings are shown). The phone 1 according to 
the preferred embodiment is adapted for communication preferable via a cellular network e.g. GSM. 
[0024] According to the preferred embodiment the keypad 2 has a first group 7 of keys as alphanumeric keys, two 

55 soft keys 8, and a four way navigation key 1 0. Furthemnore the keypad includes two call-handling keys 9 for initiating 
and terminating calls. The present functionality of the soft keys 8 is shown in a separate field in the bottom of the display 
3 just above the keys 8. This key layout is characteristic of e.g. the Nokia 6210^" phone. 

[0025] Figure 2 schematically shows the most Important parts of a preferred embodiment of the phone, said parts 



3 



EP 1 291 848 A2 



40 



45 



SO 



bemg essential to the understanding of the invention. A processor 18. which inter alia supports the GSIVI terminal 

TH ''k "'"'^ '-^^ "^««'°^^ the transmrtter/receiver circuft 1 9 and an antenna 20 

Srt J?n ^TaTh V'"?'"'""' '''''' '"^^ fo-^e«l thereby are AAD 

converted .n an A/D converter (not shown) before the speech is encoded in an audio part 14. The encoded speeS 

J9"j''««^«nsff:«dtotheprocessor18.Theprocessor18alsoforrnstheinterf^^^ 

ROM memory 1 7b a SIM card 1 6, the display 3 and the keypad 2 (as well as data, power supply. e7c. the slM^ar! 
1 6 mckides an electronic phonebook database 21 containing name labels and associated phone numberTheaud^ 
(not ihownT ' " '° ^^^'^^ « Via a D?A converter 

w [0027] A multilingual speaker independent name dialler 22 receives an audio input from the microphone 6 via the 
fZTZ 'i TT-^T' "P""'"' independent name dialler 22 compares the audio Input with the te,^ 

IS Multilingual TTP Mapping 

[0028] According to the Invention a single TTP module 34 is used for mapping text directly into pronunciations based 
on a common multilingual phoneme set in the pronunciation lexicon module ^. That Is. th'e muZgual T^^^^^^^ 
34 outputs a sequence of multilingual phoneme symbols based on written text as Input, see Figured Thfeserence 

tTTZT T" "^^ ^°^9«"«'-««"g t'^e pronunciation by means S a pronunciation lexicon module 

32 for a multilingual recogniser 33 based on multilingual acoustic phoneme models 

the l^Lr t ^"'^ P''^"^- several languages e g 

the letter p maps to the phoneme "p- (SAMPA notation) in most contexts for both English, Gemian Finnish and 

s%r;;r;rre7chr™ 

symbols for such letters. The alternative phonemes are then used ,o create a branched grimrThrpSe^ 
branched grammar decoding in combination with a multilingual TTP module 34 is illustrated for the r^ame "Peter in 
F^ure 5. The name "Peter is pronounced quite differently in different languages, but using a m uiquri^P iSdute 
34 along with branched grammar decoding allows for capturing all pronunciations haTrar^d in tl7set^?at 
guages supported by the multilingual recognition system. The different phonemes at each poSn cal be weiahted 

[0031] It should be noted at this point, that when a Viterbi decoding scheme is used, only one of the oossible oho 

ohTn™ «T "^'^ '"""^ ''''''' f -"-path fo^va^ d^odeT s utd aii 

phonemes at a given position gives a weighted contribution to the score of the word see Fiaura 5 Thus Vforwarrt 
decoding scheme will in principle allow pronunciations that are a mbcture of se^erJwe'nrpro^^^^^^ 

Experiments 

and UK), and Spaneh, where used for the design and evaluation of the overall system. For these four lanoufaes the 
total number of mono-phonemes is 133 corresponding to 39 phonemes for English. 28?rspan sh S^rSlSj 
number [ ^^""'"9 « '^"'^"'o" "^"Itilingual phoneme set sharing similar phonem^ theral 

number of mono-phonemes can be reduced to 67 without affecting overall system perfomianc^ F^Ttlnq of ^ie 
overal system recognition rateain-housetestperfom^d by the assignee ofthe present patent r^^^^^^^^^ 
full nam« ^^"'^ '""^^^^^ "^^^ °" « ^20 word vocabl^^of nl^ (2 

LlZs^alVT ■''^ """^"^^ °f t«=t "«emnces was 21900 SOsTftr UK 

English, 5283 for Spanish. 7979 for Gernian and 3600 for Finnish 

[0M3] Ashortdescriptionofthe-rrP.LIDandmultlllngualacousticmodelarchitecture.8et-up.andtralninglsprovided 
TTP mapping module 

[0034] Four different approaches for TTP mapping have been considered in this work: 
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Table 1: 



TTP training set sizes, model architectures, and model sizes. The number of input, hidden, and output units in the 
fully connected MLPs are denoted by 1, H, and O respectively. 1 


Language 


No. words 


IxHxO 


Model size 




197 277 


243x 104x46 


30 kb 


German 


255 280 


217x38x47 


10 kb 


Spanish 


36 486 


102x5x32 


0.7 kb 


Finnish 


15 103 


72 X 4 x 25 


0.4 kb 


ML 


882 915 


333 X 99 x 73 


40 kb 



15 [0035] TrueTTP: true (handmade) phoneme transcriptions. This represents the "Ideal" case where a lexicon covering 
all possible words used in the application Is available. 

[0036] noLID: language specific TTP modules assuming that the language ID of each word in the vocabulary is 
known a priori. The language ID can e.g. be set manually by the user or based on specific knowledge about the 
application. 

[0037] LID: language specific TTP modules in combination with a statistical LID module for setting the language ID 
of each vocabulary word. Note that for vocabulary entries composed of several words (e.g. first and last name) the 
language ID is set separately for each word. 
[0038] IWIL-TTP: multilingual TTP. 

[0039] Instead of using a LID module 30 or assuming that the language ID is known beforehand, a pronunciation for 
^5 each supported language could be generated for each word. Similarly, for some applications it makes sense to include 
pronunciations not only for the language selected by the LID module but also for languages known a priori to be very 
likely for the partcular application. Such methods may, however, lead to a significant increase in real-time decoding 
complexity as the active vocabulary is "artificially" increased — especially when many languages are supported by 
the system. 

30 [0040] There are several possible strategies for statistteal TTP models including e.g. decision trees and neural net- 
works. In this work, standard fully connected feed-fonvard multi-layer perceptron (MLP) neural networks have been 
chosen for the TTP module. The TTP networks take a symmetrical window of letters as input and gives a probability 
for the different phonemes for the central letter in the window. At each position in the window, the letter is encoded as 
an orthogonal binary vector in order to avoid introducing artiftelal conrelations between letters. 

35 [0041] All neural network TTP modules were designed to take up roughly the same amount of memory. Thus, the 
four language dependent TTP models use a total of 40 kb of memory (with 8 bit/parameter precision), which is the 
same amount used by the ML-TTP model. The language dependent TTP modules where trained by standard back- 
propagation using language specific lexicons and the ML-TTP module was trained on a balanced multilingual lexrcon 
containing roughly equal amounts of data from each of the four languages. The size of the training databases, the 

40 architecture and the size of the TTP networks are given in Table 1 . All training material was taken from the following 
pronunciation lexicons: BeepDic (UK), CmuDic (US), LDC-CallHome-Gernian, LDC-Call Home-Spanish. SpeechDat- 
Car-Finnish transcriptions. 

[0042] For all TTP methods except the TrueTTP method both single pronunciations (no branching) and branched 
grammars have been tested. For the branched grammars, the number of branches at each position was hard limited 
45 to a maximum of 5 and 70% of the phoneme posterior probability mass was included at each position. With this scheme, 
the real-time decoding complexity is Increased by 10-25% corresponding to an increase of 10-25% in the number of 
phonemes compared to a lexicon without branching. The largest increase of 25% was observed for the ML-TTP ap- 
proach. Due to pronunciation variation among different languages a larger number of branches are needed on average 
at each position in order to include 70%of the posterior probability mass. 

50 

LID module 

[0043] As for the TTP model there are several possible choices for the statistrcal LID module, e.g. N-grams [1], 
decision trees [3], and neural networks. In a set of initial experiments, a neural network based LID module was found 
55 to have a very good generalization ability for LID classification from short segments of text even for very compact 
network sizes. Consequently, a standard fully connected feed-forward MLP network was selected for LID classification. 
The LID neural network takes a symmetrteal window of letters as input and gives probabilities for each of the possible 
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languages at the output for the central letter in the window. The overall language probabilities for a given word was 
computed as a georrietrical average over the language probabilities for all letters in the word. The "ZLg Zouaa7" 
for a word was selected as the one with the largest overall probability. language 

in"!!' / ^° ^"'^ 333 inputs was trained by standard back-propaaation 

on a balanced set of 50 317 words- roughty 12 500 words from each of the four languages. All worS wher?picke3 
at .Bndom from the pronuncat.on lexicons used for training the TTP modules. With 8 bit/parameter prSn ,hfe S 

Multilingual acoustic module 

r^^lL Jf^ ^"""if 'ITm m"'T """^'^ ''^""^ °" ^ ^^^^^ ^y^""^ known as Hidden Neural Networks (HNN) 

The basK= Idea ,n the HNN architecture is to replace the Gaussian mixtures in each state of an HMM by state spS 

HNNs Where d«cnm.nat.vely trained on 6965 Spanish, 9234 Gemian, 5611 Finnish, 6300 US Eng hsrand 9880 UK 
S'cAr.Tf Spanish-VAHA, SpeechDat-AusthanGerman, SpeechDa^c?'^?nish Z^^^^^ 

IVn J^lT \ '^'T'""'" °' """^ i^ted With a state n a ScuS 

phoneme HNN was selected based on the following heuristics: for phonemes used by a single language zero Sin 
units are used. For phonemes shared by two or more languages the number of hidden units is equal to the num^^^^^^^^ 
languages shanng the phoneme. W.h 8 b«/pa«meter precision this resufts in a total size pf 1 7 ktf or the al'tfc 

[0046] Before training, the utterances where mixed with 3 different types of noise (car cafe music^ at SNR« i„ tho 
range 5-20dB in order to increase noise robustness of the acoustic models ^ *^ 

S JlV^'^T*^^"!''^"!"^ ""^^ P^^^^ '^^""S'^ ^ '^''CC preprocessor yielding 13 static delta 

ml .ndt. . ? '° ♦'^^ dimensional feature vector was noLlized to zero 

Tf^T« i corresponding to the log energy were nomialized to unit variance 

?r"wK T?"^ °J. ''^ ^ ''^'^^^ ^PP"^'^- This has been observed to yield a better per- 

decoding scheme is more appropriate for branched grammar as described above 

Table 2: 



Word recognition rat 
single transcription u 


f tllT'T "'^^^'''^^ ^ multilingual speaker independent name dialling application. A 
5 used for all entries in the vocabulary. 


Single 


True 


noLIO 


LID 


ML 


UK English (5038) 


92.6 


87.8 


79.7 


82.2 


German (7979) 


95.2 


92.2 


86.0 


87.4 


Spanish (5283) 


95.4 


94.3 


91.8 


92.3 


Finnish (3600) 


99.1 


98.9 


98.5 


98.3 


Average 


95.6 


93.3 


89.0 


90.1 


Table 3: 




Word recognition rate 
Branched grammars . 


IZrJT ' ' "1 ""^^^"^^ ^ multilingual speaker independent name dialling application 
are used during decoding. » rr 


Branched 


True 


noLID 


LID 


ML 


UK English (5038) 


92.6 


91.6 


81.4 


85.8 


Gemnan (7979) 


95.2 


93.5 


88.7 


92.5 


Spanish (5283) 


9i!4 


94.5 


93.0 


96.2 


Finnish (3600) 


99.1 


98.9 


98,7 


98.8 


Average 


95.6 


94.6 


90.5 


93.3 
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[0049] Even though the acoustic models have been trained using data from the above mentioned languages, they 
have been observed to give good performance for many other languages based on phonemes contained in the 67 
multilingual phoneme set. A similar observation has been done for a HMM based multilingual system built on the same 
set of 67 multilingual phonemes. I 

5 i 

Results. 

[0050] The results presented below gives the overall speech recognition performance when using the TTP methods 
described above for transcribing text in combination with the multilingual HNN speech recogniser. 

10 [0051] Table 2 shows the perfomiance when using the different TTP methods for each of the four test languages 
when a single pronunciation is used for each entry in the vocabulary (no branched grammar decoding). The column 
"TrueTTP" shows the perfonnance obtained using hand-transcriptions. "noLID", the performance obtained assuming 
known language ID of each entry and language specific TTP modules, and "LID" the performance obtained when using 
a LID module in combination with language dependent TTP modules. The last column in the table shows the perform- 

15 ance of a system employing a multilingual TTP module. 

[0052] As seen from table 2, the true transcriptions are clearly superior for all languages. However, for applications 
Intended for portable devices, there is usually hot room for storing a complete dictionary so the TTP mappings must 
be based on a more compact statistical method. Comparing the rows entitled "no-LID" and "LID" It is evident that the 
errors in language ID introduced by the statistical LID module seriously hampers the recognition perfonnance. Thus, 

20 even with a LID module that gives a fairly accurate language Identification, the performance of the overall system is 
seriously affected by incon-ect language identification for a few words. 

[0053] Table 3 illustrates the effect of applying the different statistical TTP modules with a branched grammar de- 
coding scheme. As seen all methods gain a lot from branched grammar decoding and for the Spanish test, the ML-TTP 
module in combination with branched grammar decoding even outperfomns the true TTP transcriptions. This indicates 
25 that an ML-TTP model allows for more variation in the pronunciation of words and thereby increases recognition per- 
formance. 

Interestingly, the multilingual TTP model is capable of giving almost the same perfonnance on average for the four 
languages as the combination of manually set language ID and language dependent TTP (noLID). 
[0054] Table 4 illustrates the average perfonnance over the four languages when testing in various noise environ- 
30 ments with different TTP mapping methods. As can be seen, the gain due to branched grammar decoding and multi- 
lingual TTP is maintained in noise. 



Table 4: 



Average performance over Finnish. Gemnan. Spanish, and English for various TTP methods in a multilingual speaker 
independent name dialling application in various noise environments at 10 dB SNR. Branched grammars were 
applied during decoding. 


Noise type 


True 


noLID 


LID 


ML 


Clean 


95.6 


94.6 


90.5 


93.3 


VolvoSO 


92.3 


90.6 


85.6 


88.6 


Babble 


87.6 


84.8 


80.0 


82.6 


Pop Music 


83.2 


79.8 


74.5 


77.0 



[0055] According to the invention there is provided a novel approach for generating multilingual text-to-phoneme 
mappings for use in multilingual speech recognition systems. The approach is based on the ability of the statistical 
TTP module to generate a weighting of the phonemes at each position. Here we have used the phoneme posterior 
probabilities provided by a feed-forward neural network, trained on data mixed from several languages. 
[0056] The set of weighted phonemes at each position are used to create a branched grammar, using all-path fonward 
decoding. The branched grammar allows for capturing of both inter- and intra-language pronunciation variations. 
[0057] Tests showed that the multi-lingual TTP approach along with the branched grammar offers a significant im- 
provement in recognition performance in multi-lingual phoneme based speaker independent speech recognition com- 
pared to using language dependent TTP modules in combination with a LID module. In some cases the multi-lingual 
TTP approach even improved recognition perfonnance compared to when the language ID was known a priori. 
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Claims 

1 . A method of speech recognition comprising receiving an acoustic input and characterised by generating sequenc- 
es of multilingual phoneme symbols based on a semantic code or code sequence, generating reference speech 
signals in dependence on said sequences of multilingual phoneme symbols and comparing said reference speech 
signals with the acoustic Input In order to find a match. 

2. A method according to claim 1 , wherein the semantic code or code sequence represents one or more characters 
of an alphabet or a syllabary. 

3. A method according to claim 2, wherein said code or code sequence is processed character by character and a 
neural network provides an estimate of the posterior probabilities of the different phonemes or phoneme sequences 
for each character. 

4. A method according to claim 1, 2 or 3, comprising receiving a user input, generating a semantic code or code 
sequence in dependence on said user Input, storing the generated code or code sequence in a store and retrieving 
said stored code or code sequence for generation of said reference speech signals. 

5. A speech recognition apparatus comprising transducer means (6) for receiving an acoustic input and processing 
means (18, 22), characterised in that the processing means (18, 22) Is configured for generating sequences of 
multilingual phoneme symbols based on a semantic code or code sequence, generating reference speech signals 
in dependence on said sequences of multilingual phoneme symbols and comparing said reference speech signals 
with the acoustic input in order to find a match. 

6. An apparatus according to claim 6, wherein the semantic code or code sequence represents one or more characters 
of an alphabet or a syllabary. 

7. An apparatus according to claim 6, wherein said code or code sequence is processed character by character and 
a neural network provides an estimate of the posterior probabilities of the different phonemes or phoneme se- 
quences for each character. 

8. An apparatus according to claim 7, wherein the neural network comprises a standard fully connected feed-forward 
multi-layer perceptron neural network. 

9. An apparatus according to any one of claims 5 to 8, comprising a user input means (2), wherein the processing 
means (18, 22) is configured for generating a semantic code or code sequence in dependence on signals from 
said user input means (2) and storing the generated code or code sequence in a store (21). and for retrieving said 
stored code or code sequence for generation of said reference speech signals. 

10. A communication temiinal including an apparatus according to any one of claims 5 to 9. 

11. A communication temiinal according to claim 10, wherein said store (21) comprises data forming an electronic 
phonebook including phone numbers and associated name labels. 

12. Method of speech recognition In order to Identify a speech command as a match to a written text command, and 
comprising steps of: 

providing a text input from a text database; 
receiving an acoustic input; 

generating sequences of multilingual phoneme symbols based on said text input by means of a multilingual 
text-to phoneme module; 

generating pronunciations in response to said sequences of multilingual phoneme symbols; and 
comparing said pronunciations with the acoustic input in order to find a match. 

13. Method according to claim 12 wherein the text input is processed letter by letter, and wherein a neural networi< 
provides an estimate of the posterior probabilities of the different phonemes for each letter. 

14. Method according to claim 13 comprising deriving said text input from a database containing user entered text 
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strings. 

15. System for speech recognition and comprising: 

a text database for providing a text input; 
transducer means for receiving an acoustic input; 

a multilingual text-to phoneme module for outputling sequences of multilingual phoneme symbols based on 
said text input; 

pronunciation lexicon module receiving said sequences of multilingual phoneme symbols from said multilingual 
text-to phoneme module, and for generating pronunciations in response thereto; and 
a multilingual recognizer based on multilingual acoustic phoneme models for comparing said pronunciations 
generated by the pronunciation lexicon module with the acoustic input in order to find a match. 

16. System according to claim 15, wherein the multilingual text-to phoneme module processes said text input letter 
15 by letter, and comprises a neural network for giving an estimate of the posterior probabilities of the different pho- 
nemes for each letter. 

17. System according to claim 16 wherein the neural network is a standard fully connected feed-fonvard multi-layer 
perceplron neural network. 

20 

1 8. System according to claim 1 5 wherein the text input is derived from a database containing user entered text strings. 

1 9. System according to claim 1 8 wherein the database containing user entered text strings is an electronic phonebook 
including phone numbers and associated name labels. 

25 

20. Communication temrilnal having for speech recognition unit comprising: 

a text database for providing a text input; 
transducer means for receiving an acoustic input; 

a multilingual text-to phoneme module for outputting sequences of multilingual phoneme symbols based on 
said text input; 

pronunciation lexicon module receiving said sequences of multilingual phoneme symbols from said multilingual 
text-to phoneme module, and for generating pronunciations in response thereto; and 
a multilingual recognizer based on multilingual acoustic phoneme models for comparing said pronunciations 
generated by the pronunciation lexicon module with the acoustic input in order to find a match. 

21. Communication terminal according to claim 20, wherein the multilingual text-to phoneme module processes said 
text input letter by letter, and comprises a neural network for giving an estimate of the posterior probabilities of the 
different phonemes for each letter. 

40 

22. Communication temiinal according to claim 21 wherein the neural network is a standard fully connected feed- 
forward multi-layer perceptron neural network, 

23. Communication teiminal according to claim 20 wherein the text input is derived from a database containing user 
45 entered text strings. 

24. Communication terminal according to claim 23 wherein the database containing user entered text strings is an 
electronic phonebook including phone numbers and associated name labels. 

50 
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